IRT F.Heidary
Item Response Theory (IRT) rests on two basic postulates: (a). The performance of an examinee on a test item can be predicted by a set of factors called traits, latent traits, or abilities; and (b). The relationship between examinee’s item performance and the set of traits underlying item performance can be described by a function called an item characteristics function or item characteristics curve (ICC). This function specifies that as the level of the trait increases, the probability of a correct response to an item increases. Item response theory (IRT) models show the relationship between the ability or trait (symbolized y) measured by the instrument and an item response. The item response may be dichotomous (two categories), such as right or wrong, yes or no, agree or disagree. Or, it may be polytomous (more than two categories), such as a rating from a judge or scorer. The construct measured by the items may be an academic proficiency or aptitude, or it may be an attitude or belief. What do we use IRT for? One of the basic reasons is to score tests or surveys. The IRT score is often called an ability, trait, or proficiency. The IRT scoring takes into account the item difficulty and discrimination. Items that are more discriminating or more reliable are weighted more heavily, so IRT scores can be more reliable than number-correct scores. If different examinees take different tests, the IRT scores adjust for the difference in difficulty. This makes computer adaptive testing (CAT) possible. In CAT, the test items are selected to match each examinee’s proficiency, so that the examinee will not be bored by easy items or frustrated by overly difficult items. The IRT scoring puts the scores from the different test forms onto the same metric, so that each examinee can have a customized test form. Item response theory also provides an index of the precision of the test score— the standard error of measurement—for each examinee. Additionally, IRT can be used in test or scale development. Item response theory analysis supplies indices of item difficulty and discrimination. Knowing the item difficulty is useful when building tests to match the trait levels of a target population. Another item index, discrimination, is useful for selecting items that differentiate well between examinees with low and high levels of the proficiency or attitude measured by the test items. Together, difficulty and discrimination can be used to calculate the standard error of measurement or reliability of the scores. These basic indices provided by IRT have analogs in classical test theory (CTT). The IRT indices are more readily understood in the context of the formal mathematical models for describing the item response probabilities. Three assumptions of item response theory (IRT) were introduced: uni-dimensionality, local independence, and correct model specification. 'Unidimensionality ' A test that is unidimensional consists of items that tap into only one dimension. The assumption of unidimensionality means that a set of items and/or a test measure(s) only one latent trait (θ). Whenever only a single score is reported for a test, there is an implicit assumption that the items share a common primary construct. Multidimensional IRT models exist, but they are not addressed here. Unidimensionality means that the model has a single y for each examinee, and any other factors affecting the item response are treated as random error or nuisance dimensions unique to that item and not shared by other items. Violating this assumption may lead to misestimation of parameters or standard errors. One caution concerning unidimensionality: sometimes test responses can be mathematically unidimensional even when the items measure what psychologists or educators would conceptualize as two different constructs. For example, test items may measure both knowledge and test-taking motivation. Or, items on a science test may measure both reading and science. Or, items may measure both test-taking speed and knowledge. If all items measure both constructs in the same relative proportion, then mathematically they will measure a single y that is a hybrid of both constructs (Reckase, Ackerman, & Carlson, 1988). Further, if examinees do not vary on one of the constructs, all individual differences will be due to the other construct, and the responses will be mathematically unidimensional. For the test responses to be multidimensional, different items have to tap into different combinations of the constructs, and examinees have to vary on both constructs. In the example here, if some items need lots of motivation but only moderate levels of knowledge (they take some perseverance to work through, but are not difficult if the examinee takes the time to do the work) and other items take little motivation but a high level of knowledge (an examinee either knows the answer or not), then the test will likely not be unidimensional, as long as examinees vary on both motivation and knowledge. If all examinees are sufficiently motivated, motivation will no longer be a factor, and the test responses will be unidimensional (assuming the knowledge measured is unidimensional). For another example, when a test has restrictive time limits even though its intended purpose is to measure content knowledge, the items at the end of the test may measure the construct of test-taking speed more than the items at the beginning, creating multidimensionality. If the test were administered to a group of examinees who were all relatively quick at responding to test items, the test would be more unidimensional. In short, dimensionality may be context and sample-dependent. Many methods have been proposed for testing unidimensionality. 'Local Independence ' Another assumption of IRT is local independence. If the item responses are not locally independent under a unidimensional model, another dimension must be causing the dependence. In fact local independence refers to the assumption that there is no statistical relationship between examinees’ responses to the pairs of items in a test, once the primary trait measured by the test is removed. With tests of local independence, however, the focus is on dependencies among pairs of items. These dependencies might not emerge as separate dimensions, unless they influenced a larger group of items, and thus might not be detectable by tests of unidimensionality. Consequently, separate procedures have been developed to detect local dependencies. If items are locally independent, they will be uncorrelated after conditioning on y. Again, note that the items can (and should) be correlated in the sample as a whole. It is only after controlling for y that we assume they are uncorrelated. 'Fit ' The fit between the model and the data can be assessed to check for model misspecification. If the function is not monotonically increasing, none of the common models will fit. Typically, IRT practitioners focus on the fit of individual items, not the overall fit of the model across all items. In fact it concerns the modeling of the relationship between the trait measured by the test and item responses. References Hambleton,R.K. & Swaminathan.H. & Rogers,H.J. Fundamentals of Item Response Theory. Sage Publications. DeMars,ch. Item Response Theory. Oxford University Press.